STA 214 Codebook
Use: The purpose of this codebook is to summarize key code that we use in R to visualize and explore data, fit models, assess assumptions, and test hypotheses. I will update the codebook periodically throughout the semester as we learn new topics.
Many tasks in R, such as visualizing data, can be done in multiple
different ways. Throughout this codebook I will tend to use the
tidyverse for visualizing and cleaning data, but be aware
that other options exist.
Installing packages
Many functions and datasets we want to use in R are provided in
packages. An important package for STA 214 will be the
tidyverse package, which is really a collection of R
packages with a common philosophy on how to work with data.
To install tidyverse, run the following code in your R
console (not in an R Markdown document!):
install.packages("tidyverse")To install other packages, simply replace the name
tidyverse with the name of the package you want to install.
Note that packages only ever need to be installed once. After a
package is successfully installed, you won’t need to install it again
unless you update R on your computer.
Loading packages
Installing a package downloads it to your computer, but doesn’t make
it immediately available to your R session. If you want to use the
tidyverse package, you will need to load it by
running the following code:
library(tidyverse)(The code is similar for other packages, just replace the name
tidyverse). You will need to include
library(...) calls in the setup chunk of your R Markdown
documents.
Exploratory data analysis (EDA)
In STA 112, you learned tools for exploratory data analysis (EDA), in which we explore features of the data such as the available variables and their relationships. Exploratory data analysis is an important step before we do any model fitting or hypothesis testing, because it gets us familiar with the data and any unusual features. It is hard to fit a sensible model when we don’t know what the data look like!
This section of the codebook includes examples of summarizing the
data, identifying missing values, and univariate and multivariate EDA.
The data used for this section is the penguins dataset,
from the palmerpenguins package. To load the
penguins data, run the following in R (you may need to
install the palmerpenguins package):
library(palmerpenguins)We can then look at the description of the penguins data
provided by the package:
?penguins(this only works for datasets included in packages).
Data size and variables
To peak at the data, we can use the glimpse function to
see the variables and the size:
glimpse(penguins)## Rows: 344
## Columns: 8
## $ species <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel…
## $ island <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgerse…
## $ bill_length_mm <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, …
## $ bill_depth_mm <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, …
## $ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186…
## $ body_mass_g <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, …
## $ sex <fct> male, female, female, NA, female, male, female, male…
## $ year <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…
From this summary, we can see there are 344 rows (each row is one penguin), 8 columns, and we see the first few values in each column. Another way to explore the raw data in RStudio is to run
View(penguins)in your console (not in R Markdown!).
Other functions can also be used to calculate the number of rows, the number of columns, and both the number of rows and number of columns:
nrow(penguins) # number of rows## [1] 344
ncol(penguins) # number of columns## [1] 8
dim(penguins) # dimensions of the data## [1] 344 8
Missing data
Sometimes, our data contains missing values. These are often recorded
as NA in R (for “not available”). You can see when we
glimpsed the penguins data above that there were several
missing values (NAs), and there are probably more in the
rest of the data that wasn’t displayed.
To remove missing rows which contain missing values, we can use the
drop_na function. The following code creates a new
dataset (called penguins_new) without those missing
values:
penguins_new <- penguins %>%
drop_na()How many rows did we remove (i.e., how many rows had missing values)?
We can compare the dimensions of penguins_new to
penguins:
dim(penguins_new)## [1] 333 8
Since penguins_new has 333 rows, we have removed 11 rows
due to missing values.
Removing missing values from specific columns
What if we only want to remove rows with NAs in the
flipper_length_mm column? We can specify that column in the
drop_na function:
penguins_new <- penguins %>%
drop_na(flipper_length_mm)
dim(penguins_new)## [1] 342 8
So, there were 2 rows with missing flipper length values.
A note on pipes (%>% and |>)
Throughout this codebook (and throughout my own code too!), I use the
pipe operator %>%. The pipe just means “take THIS, then
do THAT”. So,
penguins %>%
drop_na(flipper_length_mm)means “take penguins, then drop_na (remove
missing values)”. Pipes can be chained together into a longer sequence
of steps, too:
penguins %>%
drop_na(flipper_length_mm) %>%
dim()## [1] 342 8
I like pipes because I think they often make code cleaner and more readable, but you don’t have to use them. You can collapse the pipe into a nested series of functions (evaluated from the inside out):
dim(drop_na(penguins, flipper_length_mm))## [1] 342 8
Or you can save the intermediate steps:
penguins_new <- drop_na(penguins, flipper_length_mm)
dim(penguins_new)## [1] 342 8
The pipe operator %>% comes from the
magrittr package (the name is a nod to artist Rene Magritte
and his painting The Treachery of Images). Newer versions of R
also include a built-in pipe operator |> which functions
similarly:
penguins |>
drop_na(flipper_length_mm) |>
dim()## [1] 342 8
For simplicity, and because |> is more recent, in
this course I will default to the older pipe %>%.
Types of variables
Here we will deal mainly with categorical and quantitative variables.
- Categorical variables are variables which take on
one of several fixed values. These values generally do not have a
numeric interpretation.
- Examples: gender, favorite food, brand of laptop
- Binary categorical variables have exactly 2 possible values
- Quantitative variables are variables which take on
a numeric value, and which have a numeric interpretation.
- Examples: number of pets, height, weight, age
- Discrete quantitative variables only take on discrete values (e.g., number of pets)
- Continuous quantitative variables can take on an entire range of values (e.g., height is continuous if we allow heights like 60.323 inches)
An overview of ggplot for making plots
The ggplot2 package, which is part of
tidyverse, is a very valuable tool for visualizing data.
The ggplot2 packages provides the ggplot
function, which is my default for making plots. For example, here is
code to create a scatterplot:
penguins %>%
ggplot(aes(x = bill_depth_mm,
y = bill_length_mm,
color = species)) +
geom_point() +
labs(x = 'Bill depth (mm)',
y = 'Bill length (mm)',
color = 'Species',
title = "Penguin bill length vs. bill depth") +
theme_bw()## Warning: Removed 2 rows containing missing values (geom_point).
Notice that ggplot is warning us that we had some
missing values in the data. If we remove those missing values, the
warning will go away.
ggplot layers
The idea behind ggplot is to build the figure in layers.
We start off by specifying the data that we want to use
(penguins), and the variables that we want to plot
(bill_depth_mm, bill_length_mm, and
species). These variables are mapped to the
aesthetics of the plot: features like the x-axis
(x = bill_depth_mm), the y-axis
(y = bill_length_mm), and the color of the points
(color = species). We specify the aesthetics in the
aes. Other aesthetics include shape,
fill, alpha, etc.
Running the first few lines sets up the plot for us to fill in:
penguins %>%
ggplot(aes(x = bill_depth_mm,
y = bill_length_mm,
color = species))Next, we need to decide how to represent the observations in our
plot. Here we have two quantitative variables, so a scatterplot is
reasonable. We therefore add another layer to our plot, representing
each observation as a point (notice that we add layers together with the
+ sign):
penguins %>%
ggplot(aes(x = bill_depth_mm,
y = bill_length_mm,
color = species)) +
geom_point()## Warning: Removed 2 rows containing missing values (geom_point).
The way we represent our data (in this case, points) is with
geometric objects, or geoms. If we want points, we use
geom_point. Other geoms include geom_histogram
(histograms), geom_bar (bar charts),
geom_boxplot (boxplots), among many others. The sections
below give an overview of some of the most important plots.
Finally, we want to make our plot look nice. This involves changing
the axis labels and the legend, adding a title, and changing the theme.
We include axis labels in the labs layer, and we change the
theme in the theme layer. Possible themes include
theme_bw, theme_minimal,
theme_classic, theme_light, and many others. I
typically use theme_bw, but you may choose whichver theme
you prefer.
penguins %>%
ggplot(aes(x = bill_depth_mm,
y = bill_length_mm,
color = species)) +
geom_point() +
labs(x = 'Bill depth (mm)',
y = 'Bill length (mm)',
color = 'Species',
title = "Penguin bill length vs. bill depth") +
theme_bw()## Warning: Removed 2 rows containing missing values (geom_point).
Univariate exploratory data analysis
Here we discuss how to summarize, visualize, and describe the distributions of categorical and quantitative variables.
Categorical variables
Summarize
The number of observations in each group can be summarized in a frequency table:
penguins %>%
count(species)## # A tibble: 3 × 2
## species n
## <fct> <int>
## 1 Adelie 152
## 2 Chinstrap 68
## 3 Gentoo 124
Visualize
The same information can be visualized with bar charts, which display the number of observations in each group as the height of the bar:
penguins %>%
ggplot(aes(x = species)) +
geom_bar() +
labs(x = "Species") +
theme_bw()Other visualization options include pie charts.
Describe
- Which category has the most number of observations? The least?
- Are observations spread relatively evenly across categories, or do one or two categories have the majority of the observations?
Quantitative variables
Summarize
Many summary statistics can be calculated for quantitative variables. We often calculate the mean or median to summarize the center of the distribution, and the standard deviation or IQR to summarize the spread of the distribution. If the data are highly skewed, the median and IQR are often more appropriate measures of center and spread.
Note that if NAs (missing values) are present in the data, then we
need to remove them before calculating summary statistics. This can be
done by removing all rows with NAs (drop_na()), or by
ignoring NAs when we calculate the summary statistics
(na.rm=TRUE).
penguins %>%
summarize(mean_mass = mean(body_mass_g, na.rm=TRUE),
median_mass = median(body_mass_g, na.rm=TRUE),
sd_mass = sd(body_mass_g, na.rm=TRUE),
iqr_mass = IQR(body_mass_g, na.rm=TRUE))## # A tibble: 1 × 4
## mean_mass median_mass sd_mass iqr_mass
## <dbl> <dbl> <dbl> <dbl>
## 1 4202. 4050 802. 1200
Visualize
A good choice for visualize the distribution of a quantitative
variable is with a histogram. A histogram divides the
range of the data into evenly spaced bins, and then displays the number
of observations which fall into each bin. Since the number of bins
affects how the histogram looks, it is good practice to experiment with
several different numbers of bins. This can be specified with the
bins argument in geom_histogram.
penguins %>%
ggplot(aes(x = body_mass_g)) +
geom_histogram(bins = 20) +
labs(x = "Body mass (g)") +
theme_bw()Another common option for visualization is the boxplot. A boxplot doesn’t show the whole distribution, but rather a summary of it. In particular, it displays the median, first and third quartiles, the smallest and largest non-outlier values, and any outliers.
penguins %>%
ggplot(aes(y = body_mass_g)) +
geom_boxplot() +
labs(y = "Body mass (g)") +
theme_bw()Other tools include density plots (geom_density) and
violin plots (geom_violin).
Describe
- Shape (symmetric vs. skewed, number of modes, location of modes)
- Center (usually mean or median)
- Spread (usually standard deviation or IQR)
- Any unusual features?
- Any potential outliers?
Bivariate exploratory data analysis
What if we want to look at the relationship between two variables?
Two categorical variables
Summarize
We can count the number of observations in each group:
penguins %>%
count(species, island)## # A tibble: 5 × 3
## species island n
## <fct> <fct> <int>
## 1 Adelie Biscoe 44
## 2 Adelie Dream 56
## 3 Adelie Torgersen 52
## 4 Chinstrap Dream 68
## 5 Gentoo Biscoe 124
Sometimes, it is nice to display the result as a two-way table, where categories for one variable are in the rows, and categories for the second variable are in the columns:
penguins %>%
count(species, island) %>%
spread(island, n)## # A tibble: 3 × 4
## species Biscoe Dream Torgersen
## <fct> <int> <int> <int>
## 1 Adelie 44 56 52
## 2 Chinstrap NA 68 NA
## 3 Gentoo 124 NA NA
(Note that here, NA means that this combination of values did not appear in the dataset. So, e.g., there were no Chinstrap penguins from Biscoe island).
Visualize
A common way to visualize the relationship between two categorical variables is with a stacked bar graph:
penguins %>%
ggplot(aes(x = species, fill = island)) +
geom_bar() +
labs(x = "Species",
fill = "Island") +
theme_bw()Other options include mosaic plots.
Describe
- Which combination of categories has the most observations? The least?
- Are there any combinations which do not appear in the data?
- Is the distribution for the second variable the same for each level
of the first variable? (E.g., in the
penguinsexample above, there appears to be a relationship between species and island, because the distribution of penguins in each island is different for the three species. Adelie penguins are found on all three islands, whereas Chinstrap and Gentoo penguins are only on one).
Two quantitative variables
Visualize
To visualize the relationship between two quantitative variables, we can use a scatterplot:
penguins %>%
ggplot(aes(x = flipper_length_mm,
y = body_mass_g)) +
geom_point() +
labs(x = "Flipper length (mm)",
y = "Body mass (g)") +
theme_bw()Summarize
If the relationship looks linear, we can calculate the sample correlation coefficient, \(r\), to summarize the strength of the linear relationship. Recall that \(r\) takes values between -1 and 1, with \(r = -1\) a very strong negative relationship, \(r = 0\) no relationship, and \(r = 1\) a very strong positive relationship.
When calculating the correlation, we have to handle NAs, if missing
values are present in the data. This can be done either by removing all
rows with NAs before hand (drop_na()), or by ignoring NAs
when computing correlation (use = "complete.obs").
penguins %>%
summarize(r = cor(flipper_length_mm,
body_mass_g,
use="complete.obs"))## # A tibble: 1 × 1
## r
## <dbl>
## 1 0.871
Describe
- does there appear to be a relationship?
- if so, does the relationship appear to be positive or negative?
- what is the general shape of the relationship? Does it look linear?
- if the relationship looks linear, report the sample correlation coefficient
One categorical, one quantitative
Visualize
There are several options for visualizing the relationship between a categorical and a quantitative variable. A common choice is to make a boxplot for each level of the categorical variable:
penguins %>%
ggplot(aes(x = species, y = body_mass_g)) +
geom_boxplot() +
labs(x = "Species", y = "Body mass (g)") +
theme_bw()While boxplots are just summaries of a distribution, they are very handy for comparing across groups.
Another option, if the number of categories isn’t too large, is to create a histogram faceted by the categorical variable:
penguins %>%
ggplot(aes(x = body_mass_g)) +
geom_histogram(bins = 20) +
facet_wrap(~species) +
labs(x = "Body mass (g)") +
theme_bw()Summarize
To summarize the relationship, we can calculate summary statistics
for the quantitative variable at each level of the categorical variable.
The group_by function is very helpful here.
penguins %>%
group_by(species) %>%
summarize(mean_mass = mean(body_mass_g, na.rm=TRUE),
median_mass = median(body_mass_g, na.rm=TRUE))## # A tibble: 3 × 3
## species mean_mass median_mass
## <fct> <dbl> <dbl>
## 1 Adelie 3701. 3700
## 2 Chinstrap 3733. 3700
## 3 Gentoo 5076. 5000
Describe
Is the distribution of the quantitative variable different across levels of the categorical variable? If so, how? (e.g., differences in shape, center, spread)
More than two variables
With more than two variables, you can get a lot of combinations. Here are just a couple examples. Using additional aesthetics and faceting is helpful for visualization. Using grouping is helpful for summary statistics.
Quantitative, quantitative, categorical
penguins %>%
ggplot(aes(x = bill_depth_mm,
y = body_mass_g,
color = species)) +
geom_point() +
labs(x = "Bill depth (mm)",
y = "Body mass (g)",
color = "Species") +
theme_bw()penguins %>%
group_by(species) %>%
summarize(r = cor(bill_depth_mm,
body_mass_g,
use="complete.obs"))## # A tibble: 3 × 2
## species r
## <fct> <dbl>
## 1 Adelie 0.576
## 2 Chinstrap 0.604
## 3 Gentoo 0.719
Quantitative, categorical, categorical
penguins %>%
ggplot(aes(x = island,
y = body_mass_g)) +
geom_boxplot() +
facet_wrap(~species) +
labs(x = "Island",
y = "Body mass (g)") +
theme_bw()penguins %>%
ggplot(aes(x = body_mass_g)) +
geom_histogram(bins=15) +
facet_grid(island~species) +
labs(x = "Body mass (g)") +
theme_bw()penguins %>%
group_by(species, island) %>%
summarize(mean_mass = mean(body_mass_g, na.rm=TRUE),
median_mass = median(body_mass_g, na.rm=TRUE))## # A tibble: 5 × 4
## # Groups: species [3]
## species island mean_mass median_mass
## <fct> <fct> <dbl> <dbl>
## 1 Adelie Biscoe 3710. 3750
## 2 Adelie Dream 3688. 3575
## 3 Adelie Torgersen 3706. 3700
## 4 Chinstrap Dream 3733. 3700
## 5 Gentoo Biscoe 5076. 5000
Logistic regression
In this section, we will continue to use the penguins
data from the palmerpenguins package (see the EDA section
above for instructions on loading the data).
Two components of a parametric model
In the penguins data, sex is a categorical
variable which is recorded as either male or female. Suppose we want to
predict penguin sex based on physical characteristics like flipper
length and body mass. Let \(Y_i = 0\)
if sex = female, and \(Y_i = 1\) if sex
= 1 (I am coding male as 1 here because by default, R will order the
levels of sex alphabetically, so female and then male).
Then we might propose the following population logistic
regression model:
\[Y_i \sim Bernoulli(\pi_i)\] \[\log \left( \dfrac{\pi_i}{1 - \pi_i} \right) = \beta_0 + \beta_1 FlipperLength_i + \beta_2 BodyMass_i\] Here \(\pi_i\) denotes the probability that \(Y_i = 1\), and the log odds are a linear function of flipper length and body mass.
The first line of this model is called the random component: it specifies the distribution of the random variable \(Y_i\). The second line is called the systematic component: it specifies how \(\pi_i\) is related to the explanatory variables.
Fitting logistic regression models
To fit the above logistic regression model in R, we use the
glm function:
m1 <- glm(sex ~ flipper_length_mm + body_mass_g,
data = penguins, family = binomial)
summary(m1)##
## Call:
## glm(formula = sex ~ flipper_length_mm + body_mass_g, family = binomial,
## data = penguins)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.9901 -0.9598 0.2442 0.8898 2.1068
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 7.0868039 2.6621921 2.662 0.00777 **
## flipper_length_mm -0.0932568 0.0199664 -4.671 3.00e-06 ***
## body_mass_g 0.0027820 0.0003923 7.092 1.32e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 461.61 on 332 degrees of freedom
## Residual deviance: 371.98 on 330 degrees of freedom
## (11 observations deleted due to missingness)
## AIC: 377.98
##
## Number of Fisher Scoring iterations: 4
The glm function stands for generalized linear
model. We specify the response variable on the left hand side of
the ~, and the explanatory variables on the right hand
side. To fit logistic regression, we specify
family = binomial.
The summary function summarizes our fitted model. We can
see that \(\widehat{\beta}_0 = 7.087\),
\(\widehat{\beta}_1 = -0.093\), and
\(\widehat{\beta}_2 = 0.003\). Our
fitted logistic regression model is therefore
\[Y_i \sim Bernoulli(\pi_i)\] \[\log \left( \dfrac{\widehat{\pi}_i}{1 - \widehat{\pi}_i} \right) = 7.087 -0.093 FlipperLength_i + 0.003 BodyMass_i\]
Note that the equation of the fitted model includes hats on the \(\pi_i\).
Interpreting regression coefficients
The fitted coefficients above can be interpreted on the log odds scale, or on the odds scale.
Log odds:
- The estimated log odds that a penguin is male when flipper length and body mass are both 0 is 7.087. (Since all penguins have flippers and no penguins weigh 0g, this interpretation doesn’t mean much)
- A one mm increase in flipper length is associated with an estimated decrease in the log odds that a penguin is male by 0.093, holding body mass constant
- A one g increase in body mass is associated with an estimated increase in the log odds that a penguin is male by 0.003, holding flipper length constant
Odds:
- The estimated log odds that a penguin is male when flipper length and body mass are both 0 is \(\exp\{7.087\} = 1196\). (Since all penguins have flippers and no penguins weigh 0g, this interpretation doesn’t mean much)
- A one mm increase in flipper length is associated with an estimated decrease in the log odds that a penguin is male by a factor of \(\exp\{-0.093\} = 0.911\), holding body mass constant
- A one g increase in body mass is associated with an estimated increase in the log odds that a penguin is male by a factor of \(\exp\{0.003\} = 1.003\), holding flipper length constant
Empirical logit plots
The following R function can be used to create empirical logit plots:
logodds_plot <- function(data, num_bins, bin_method,
x_name, y_name, grouping = NULL,
reg_formula = y ~ x){
if(is.null(grouping)){
dat <- data.frame(x = data %>% pull(x_name),
y = data %>% pull(y_name),
group = 1)
} else {
dat <- data.frame(x = data %>% pull(x_name),
y = data %>% pull(y_name),
group = as.factor(data %>% pull(grouping)))
}
if(bin_method == "equal_size"){
logodds_table <- dat %>%
drop_na() %>%
arrange(group, x) %>%
group_by(group) %>%
mutate(obs = y,
bin = rep(1:num_bins,
each=ceiling(n()/num_bins))[1:n()]) %>%
group_by(bin, group) %>%
summarize(mean_x = mean(x),
prop = mean(c(obs, 0.5)),
num_obs = n()) %>%
ungroup() %>%
mutate(logodds = log(prop/(1 - prop)))
} else {
logodds_table <- dat %>%
drop_na() %>%
group_by(group) %>%
mutate(obs = y,
bin = cut(x,
breaks = num_bins,
labels = F)) %>%
group_by(bin, group) %>%
summarize(mean_x = mean(x),
prop = mean(c(obs, 0.5)),
num_obs = n()) %>%
ungroup() %>%
mutate(logodds = log(prop/(1 - prop)))
}
if(is.null(grouping)){
logodds_table %>%
ggplot(aes(x = mean_x,
y = logodds)) +
geom_point(size=2) +
geom_smooth(se=F, method="lm", formula = reg_formula) +
theme_bw() +
labs(x = x_name,
y = "Empirical log odds") +
theme(text = element_text(size=15))
} else {
logodds_table %>%
ggplot(aes(x = mean_x,
y = logodds,
color = group,
shape = group)) +
geom_point(size=2) +
geom_smooth(se=F, method="lm", formula = reg_formula) +
theme_bw() +
labs(x = x_name,
y = "Empirical log odds",
color = grouping,
shape = grouping) +
theme(text = element_text(size=15))
}
}Arguments:
data: the dataset of interest (e.g.,titanic)num_bins: the number of bins to use- The number of bins should be chosen based on the size of the data.
For example, with
bin_method = "equal_size", we probably want at least 15 observations per bin, sonum_bins< (number of observations)/15
- The number of bins should be chosen based on the size of the data.
For example, with
bin_method: whether to choose bins with the same number of observations ("equal_size") or the same width ("equal_width")x_name: the name of the column containing the explanatory variable (e.g.,"Fare"). The quotation marks are neededy_name: the name of the column containing the response (e.g.,"Survived"). The quotation marks are neededgrouping: the name of a categorical variable in the data; fit a separate line for each level of the categorical variablereg_formula: This is the shape of the relationship you want to plot. If you want a line, this is y ~ x (the default). Some other examples:y ~ log(x): a log transformation of xy ~ sqrt(x): a square root transformation of xy ~ poly(x, 2): a second-order polynomialy ~ poly(x, 3): a third-order polynomial
Examples
The titanic dataset contains a number of variables
recorded for passengers on the Titanic when it sunk on its
first voyage. The response variable we care about with the
titanic data is Survived: whether a passenger
survived the disaster or not. Let’s first look at Fare as
an explanatory variable. We will make an empirical logit plot with 30
equally sized bins:
library(tidyverse)
logodds_plot(titanic, 30, "equal_size", "Fare", "Survived",
reg_formula = y ~ x)That linear fit doesn’t look very good! Maybe we should try a different shape. Let’s try a quadratic fit instead:
logodds_plot(titanic, 30, "equal_size", "Fare", "Survived",
reg_formula = y ~ poly(x, 2))And maybe a log transformation too:
logodds_plot(titanic, 30, "equal_size", "Fare", "Survived",
reg_formula = y ~ log(x))Quantile residual plots
Quantile residual plots can be created using the
qresid() function of the statmod package. For
example, suppose we fit a logistic regression on the
titanic data with Survival as the response and Fare as the
explanatory variable:
m1 <- glm(Survived ~ Fare, data = titanic, family = binomial)Let’s assess whether the shape assumption is reasonable for Fare with a quantile residual plot:
library(statmod)
titanic %>%
mutate(residuals = qresid(m1)) %>%
ggplot(aes(x = Fare, y = residuals)) +
geom_point() +
geom_smooth() +
theme_bw()## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
The results aren’t amazing…what if we try transforming Fare? Let’s try a square root:
m2 <- glm(Survived ~ sqrt(Fare), data = titanic, family = binomial)
titanic %>%
mutate(residuals = qresid(m2)) %>%
ggplot(aes(x = sqrt(Fare), y = residuals)) +
geom_point() +
geom_smooth() +
theme_bw()## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
Looks a bit better!
Poisson regression (and variants)
In this section, we will work with the PhD articles data
from class on March 3. Our data contains information on 915 biochemistry
PhD students, with variables
art: articles published in last three years of Ph.D.fem: gender (recorded as male or female)mar: marital status (recorded as married or single)kid5: number of children under age sixphd: prestige of Ph.D. programment: articles published by their mentor in last three years
The data can be loaded by:
library(foreign)
articles <- read.dta("http://www.stata-press.com/data/lf2/couart2.dta")Count variables and the Poisson distribution
Suppose we want to model the number of articles published by each
student. This is a count variable: it takes values 0, 1, 2, 3,
etc. (non-negative integers). Here is a histogram showing the
distribution of the number of articles published in the
articles data:
articles %>%
ggplot(aes(x = art)) +
geom_histogram(bins = 15) +
theme_bw() +
labs(x = "Number of articles published")We can see that this is a unimodal, right-skewed count distribution. How should we model such a variable? One option is the Poisson distribution: we say that \(Y \sim Poisson(\lambda)\) if \[P(Y = y) = \frac{e^{-\lambda}\lambda^y}{y!}\] The value of \(\lambda\) changes the Poisson distribution: \(\lambda\) is both the mean and variance of the distribution, so as \(\lambda\) increases, the distribution shifts right and becomes more spread out. Here are some plots showing samples from Poisson distributions with different values of \(\lambda\):
Fitting a Poisson regression model
Suppose that we want to model the number of articles published, using the number of articles published by the a student’s mentor and the prestige of their PhD program as the explanatory variables. Our population model is
\[Articles_i \sim Poisson(\lambda_i)\]
\[\log(\lambda_i) = \beta_0 + \beta_1 Prestige_i + \beta_2 Mentor_i\] The random component specifies that we are using a Poisson distribution for the response, with parameter \(\lambda_i\). Our systematic component tells us that \(\lambda_i\) depends on prestige and the number of mentor articles. We use \(\log(\lambda_i)\) as our link function, because \(\lambda_i \in (0, \infty)\) and so \(\log(\lambda_i) \in (-\infty, \infty)\).
To fit this model in R, we use the glm function, and
specify family=poisson:
m_art <- glm(art ~ phd + ment, family = poisson, data = articles)
summary(m_art)##
## Call:
## glm(formula = art ~ phd + ment, family = poisson, data = articles)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.4999 -1.6230 -0.3706 0.4921 5.9909
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.225505 0.085692 2.632 0.0085 **
## phd 0.011575 0.026367 0.439 0.6607
## ment 0.025850 0.001975 13.088 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 1817.4 on 914 degrees of freedom
## Residual deviance: 1669.4 on 912 degrees of freedom
## AIC: 3343.1
##
## Number of Fisher Scoring iterations: 5
Interpreting coefficients
The coefficients of a Poisson regression model can be interpret on the log scale, or on the original scale of the data. For example, in the model above, \(\widehat{\beta}_1 = 0.012\). This means that a one-unit increase in PhD prestige is associated with an increase of 0.012 in the log-mean number of articles published by PhD students, after accounting for the number of articles published by their mentor. Or, a one-unit increase in PhD prestige is associated with an increase in the mean number of articles published by a PhD student by a factor of \(\exp{0.012} = 1.012\), after accounting for the number of articles published by their mentor.
Assumptions and diagnostics
The Poisson regression model makes several assumptions:
- Shape: the log mean is a linear function of the explanatory variables (we can use transformed explanatory variables if needed)
- Distribution: the Poisson distribution is a reasonable distribution for the response variable
- Independence: the observed data are independent (needed for maximum likelihood estimation)
To generally assess whether the proposed model is a good fit to the data, we can use a goodness-of-fit test. To explore the shape assumption, we can use empirical log means plots and quantile residual plots. To explore the distribution assumption, we can use a quantile residual plot.
Goodness-of-fit
The goodness-of-fit test tests the hypotheses
\[H_0: \text{ the model is a good fit to the data}\] \[H_A: \text{ the model is not a good fit to the data}\]
For a Poisson regression model, if the model is a good fit, then the residual deviance should follow a \(\chi^2_{n-p}\) distribution, where \(n\) is the number of observations in the data and \(p\) is the number of parameters in the model (including the intercept).
In the model above, the residual deviance is 1669.4 and \(n - p = 912\). Our p-value for the goodness-of-fit test is therefore
pchisq(1669.4, 912, lower.tail=F)## [1] 4.093545e-47
Our p-value is very close to 0, so perhaps our model is not a good fit to the data. This could occur if we need to include more variables in the model, if our shape assumption is violated, or if the Poisson distribution was the wrong choice.
Important note: The goodness-of-fit test cannot be used with binary logistic regression, quasi-Poisson models, or negative binomial models. In this course, only the Poisson model allows the goodness-of-fit test
Empirical log means plots
The following R function can be used to create empirical log means plots:
logmean_plot <- function(data, num_bins, bin_method,
x, y, grouping = NULL, reg_formula = y ~ x){
if(is.null(grouping)){
dat <- data.frame(x = data[,x],
y = data[,y],
group = 1)
} else {
dat <- data.frame(x = data[,x],
y = data[,y],
group = data[,grouping])
}
if(bin_method == "equal_size"){
log_table <- dat %>%
drop_na() %>%
arrange(group, x) %>%
group_by(group) %>%
mutate(obs = y,
bin = rep(1:num_bins,
each=ceiling(n()/num_bins))[1:n()]) %>%
group_by(bin, group) %>%
summarize(mean_x = mean(x),
mean_y = mean(obs),
num_obs = n()) %>%
ungroup() %>%
mutate(log_mean = log(mean_y))
} else {
log_table <- dat %>%
drop_na() %>%
group_by(group) %>%
mutate(obs = y,
bin = cut(x,
breaks = num_bins,
labels = F)) %>%
group_by(bin, group) %>%
summarize(mean_x = mean(x),
mean_y = mean(obs),
num_obs = n()) %>%
ungroup() %>%
mutate(log_mean = log(mean_y))
}
if(is.null(grouping)){
log_table %>%
ggplot(aes(x = mean_x,
y = log_mean)) +
geom_point() +
geom_smooth(se=F, method="lm", formula = reg_formula) +
theme_bw() +
labs(x = x,
y = "Empirical log mean count")
} else {
log_table %>%
ggplot(aes(x = mean_x,
y = log_mean,
color = group,
shape = group)) +
geom_point() +
geom_smooth(se=F, method="lm", formula = reg_formula) +
theme_bw() +
labs(x = x,
y = "Empirical log mean count",
color = grouping,
shape = grouping)
}
}Arguments:
data: the dataset of interest (e.g.,titanic)num_bins: the number of bins to use- The number of bins should be chosen based on the size of the data.
For example, with
bin_method = "equal_size", we probably want at least 15 observations per bin, sonum_bins< (number of observations)/15
- The number of bins should be chosen based on the size of the data.
For example, with
bin_method: whether to choose bins with the same number of observations ("equal_size") or the same width ("equal_width")x: the name of the column containing the explanatory variable (e.g.,"phd"). The quotation marks are neededy: the name of the column containing the response (e.g.,"art"). The quotation marks are neededgrouping: the name of a categorical variable in the data; fit a separate line for each level of the categorical variablereg_formula: This is the shape of the relationship you want to plot. If you want a line, this is y ~ x (the default). Some other examples:y ~ log(x): a log transformation of xy ~ sqrt(x): a square root transformation of xy ~ poly(x, 2): a second-order polynomialy ~ poly(x, 3): a third-order polynomial
Examples:
logmean_plot(articles, 20, "equal_size", x="phd", y = "art")We can see that the assumption that the log mean number of articles is a linear function of PhD prestige is reasonable. There is a fair bit of variability in the points, but that does not indicate a violation of the shape assumption.
Quantile residual plots
As with logistic regression, quantile residual plots can be used for assessing the shape assumption:
articles %>%
mutate(residuals = qresid(m_art)) %>%
ggplot(aes(x = phd, y = residuals)) +
geom_point() +
geom_smooth() +
theme_bw()## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
For count data, we are looking for three things in our quantile residual plots:
- Is there a pattern to the residuals? (i.e., does the trend line look sufficiently different from a horizontal line?) If so, we may have a violation of the shape assumption. On the other hand, if the residuals are randomly scattered above and below 0, then the shape assumption is reasonable.
- Are the residuals mostly between -2 and 2? If not, we may have overdispersion in our data. Overdispersion in a Poisson regression model suggests we may need a quasi-Poisson or negative binomial model instead
- Is the variance of the residuals constant for each value of the explanatory variable? If not, then we may have the wrong mean-variance relationship. Non-constant variance in the quantile residual plot for a Poisson regression model suggests we may need a negative binomial model instead
Overdispersion and quasi-Poisson models
The Poisson distribution has the same mean and variance ( \(\lambda\) ), so when we fit a Poisson model we are assuming the mean and variance of the response are the same. Sometimes, however, the variance is larger than the mean, which causes problems with hypothesis tests and confidence intervals.
The dispersion parameter is \(\phi = \dfrac{\text{Variance}}{\text{Mean}}\). The Poisson distribution assumes that \(\phi = 1\). Overdispersion occurs when \(\phi > 1\).
To allow for \(\phi \neq 1\), we can fit a quasi-Poisson model:
m_art_qp <- glm(art ~ phd + ment, family = quasipoisson, data = articles)
summary(m_art_qp)##
## Call:
## glm(formula = art ~ phd + ment, family = quasipoisson, data = articles)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.4999 -1.6230 -0.3706 0.4921 5.9909
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.225505 0.117808 1.914 0.0559 .
## phd 0.011575 0.036250 0.319 0.7496
## ment 0.025850 0.002715 9.520 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for quasipoisson family taken to be 1.890036)
##
## Null deviance: 1817.4 on 914 degrees of freedom
## Residual deviance: 1669.4 on 912 degrees of freedom
## AIC: NA
##
## Number of Fisher Scoring iterations: 5
The estimated dispersion parameter for this quasi-Poisson model is \(\widehat{\phi} - 1.89\), which means that the variance of our response is approximately 1.89 times the mean of our response.
Negative binomial models
An alternative to the quasi-Poisson is the negative binomial. If \(Y \sim NB(\theta, p)\), then \(Y\) takes values \(y = 0, 1, 2, 3, ...\) with probabilities
\[P(Y = y) = \dfrac{(y + \theta - 1)!}{y!(\theta - 1)!} (1 - p)^\theta p^y\]
- \(\theta > 0\), \(\ \ \ p \in [0, 1]\)
- Mean = \(\dfrac{p \theta}{1 - p} = \mu\)
- Variance = \(\dfrac{p \theta}{(1 - p)^2} = \mu + \dfrac{\mu^2}{\theta}\)
- Variance is a quadratic function of the mean
An example of a negative binomial regression model for the articles data would be
\[Articles_i \sim NB(\theta, p_i)\]
\[\log(\mu_i) = \beta_0 + \beta_1 Prestige_i + \beta_2 Mentor_i\]
Note that we have changed the distribution of the response, but the systematic component is the same: we assume the log-mean number of articles is a linear function of the explanatory variables.
Fitting and interpreting negative binomial models
We can fit negative binomial models using the glm.nb
function from the MASS package:
library(MASS)##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
m_art_nb <- glm.nb(art ~ phd + ment, data = articles)
summary(m_art_nb)##
## Call:
## glm.nb(formula = art ~ phd + ment, data = articles, init.theta = 2.136178063,
## link = log)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.1363 -1.4091 -0.2751 0.3929 3.4907
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.177744 0.115517 1.539 0.124
## phd 0.013150 0.036156 0.364 0.716
## ment 0.030005 0.003237 9.270 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Negative Binomial(2.1362) family taken to be 1)
##
## Null deviance: 1087.1 on 914 degrees of freedom
## Residual deviance: 1002.4 on 912 degrees of freedom
## AIC: 3147.5
##
## Number of Fisher Scoring iterations: 1
##
##
## Theta: 2.136
## Std. Err.: 0.247
##
## 2 x log-likelihood: -3139.542
Notice that we do not need to specify a family, because the
glm.nb function is specific to negative binomial models.
Interpretation is very similar to a Poisson; for example, here \(\widehat{\beta}_1 = 0.013\), so a one-unit
increase in PhD prestige is associated with an increase of 0.013 in the
log-mean number of articles published by PhD students, after accounting
for the number of articles published by their mentor. Or, a one-unit
increase in PhD prestige is associated with an increase in the
mean number of articles published by a PhD student by a
factor of \(\exp{0.013} =
1.013\), after accounting for the number of articles published by
their mentor.
Poisson vs. quasi-Poisson vs. negative binomial
Poisson:
- Mean = \(\lambda_i\)
- Variance = \(\lambda_i\) (mean and variance are the same)
quasi-Poisson:
- Mean = \(\lambda_i\)
- Variance = \(\phi \lambda_i\) (variance is linear function of the mean)
negative binomial:
- Mean = \(\mu_i\)
- Variance = \(\mu_i + \dfrac{\mu_i^2}{\theta}\) (variance is a quadratic function of the mean)
A natural question is, how do we choose which of these models to use? Quantile residual plots can be helpful here. Below are four quantile residual plots:
- (a): data come from a Poisson distribution, and we fit a Poisson regression model. Most quantile residuals are between -2 and 2, and the variance of the quantile residuals appears constant, so a Poisson regression model is appropriate.
- (b): here we have overdispersed data (variance is too large for a Poisson), but we fit a Poisson model anyway. Variance is constant, but the spread of the quantile residuals is too large (many residuals above 3), so either a quasi-Poisson or negative binomial model could be used
- (c): Here we have data from a negative binomial distribution, but we fit a Poisson model anyway. There is clear non-constant variance, and the spread of the residuals is too large. A negative binomial model should be used.
- (d): Same data as in (c), but now we fit the (correct) negative binomial model. The plot looks much better!
Steps for choosing a count model:
- First, fit a Poisson model and look at the quantile residual plots. Address any violations of the shape assumption before proceeding
- For the Poisson quantile residual plots, look at the spread of the residuals.
- If the residuals have constant variance and are mostly between -2 and 2 (and almost all between -3 and 3), then a Poisson distribution is reasonable!
- If the residuals have constant variance but the spread is too large, choose either a quasi-Poisson or negative binomial
- If the residuals have non-constant variance, choose a quasi-Poisson
Zero-inflated Poisson (ZIP) models
In this section, we will consider the campus drinking data from class. We have survey data from 77 college students on a dry campus (i.e., alcohol is prohibited) in the US. Survey asks students “How many alcoholic drinks did you consume last weekend?”
drinks: the number of drinks the student reports consumingsex: an indicator for whether the student identifies as maleOffCampus: an indicator for whether the student lives off campusFirstYear: an indicator for whether the student is a first-year student
Suppose we want to model the number of drinks students consume. This is still a count variable, but let’s look at a histogram:
There is a big spike in the data near 0, because many of the students reported consuming 0 drinks. This is too many zeros for a Poisson distribution, and suggests we have zero-inflated data.
Zero-inflation often occurs when we have different sources of 0s. In this example, students may report 0 drinks because they simply chose not to drink last weekend (but do sometimes drink), or because they never drink. To model the number of drinks consumed, we need a mixture of the drinkers and the non-drinkers.
The ZIP model
Here is a ZIP model for the number of drinks consumed, where the logistic component includes whether or not the student was a first year, and the Poisson component includes whether the student lives off campus, and whether they identify as male:
\[P(Y_i = y) = \begin{cases} e^{-\lambda_i}(1 - \alpha_i) + \alpha_i & y = 0 \\ \dfrac{e^{-\lambda_i} \lambda_i^y}{y!}(1 - \alpha_i) & y > 0 \end{cases}\]
\(\log \left( \dfrac{\alpha_i}{1 - \alpha_i} \right) = \gamma_0 + \gamma_1 FirstYear_i\)
\(\log(\lambda_i) = \beta_0 + \beta_1 OffCampus_i + \beta_2 Male_i\)
Parameters:
- \(\alpha_i = P(Z_i = 1)\), where \(Z_i\) is our latent variable. In this example, \(Z_i = 1\) denotes that the student never drinks, so \(\alpha_i\) is the probability a student is a non-drinker
- \(\lambda_i = \mathbb{E}[Y_i|Z_i = 0]\) (the mean of \(Y_i\) for the group \(Z_i = 0\)). In this example, \(\lambda_i\) is the mean number of drinks consumed by students who sometimes drink.
- The \(\gamma\)s are the coefficients relating \(\alpha_i\) to the explanatory variables
- The \(\beta\)s are the coefficients relating \(\lambda_i\) to the explanatory variables
Fitting and interpreting the model
To fit the model, we use the zeroinfl function from the
pscl package:
library(pscl)
m_zip <- zeroinfl(drinks ~ OffCampus + sex | FirstYear,
data = wdrinks)
summary(m_zip)##
## Call:
## zeroinfl(formula = drinks ~ OffCampus + sex | FirstYear, data = wdrinks)
##
## Pearson residuals:
## Min 1Q Median 3Q Max
## -1.1118 -0.8858 -0.5290 0.6367 5.2996
##
## Count model coefficients (poisson with log link):
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.7543 0.1440 5.238 1.62e-07 ***
## OffCampusTRUE 0.4159 0.2059 2.020 0.0433 *
## sexm 1.0209 0.1752 5.827 5.63e-09 ***
##
## Zero-inflation model coefficients (binomial with logit link):
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.6036 0.3114 -1.938 0.0526 .
## FirstYearTRUE 1.1364 0.6095 1.864 0.0623 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Number of iterations in BFGS optimization: 8
## Log-likelihood: -140.8 on 5 Df
We interpret the coefficients separately for the logistic and Poisson components. For example, \(\widehat{\gamma}_1 = 1.14\), so the odds of being a non-drinker are higher by a factor of \(\exp\{1.14\} = 3.13\) for first-year students. And \(\widehat{\beta}_1 = 0.42\), so among students who do drink, the average number of drinks consumed is higher by a factor of \(\exp\{0.42\} = 1.52\) for students who live off-campus, holding sex fixed.
Diagnostics for the ZIP model
We can use quantile residual plots to assess the shape assumptions
for the ZIP model, though when violations occur, it is challenging to
identify which component of the model (logistic or Poisson) is violated.
The qresid function in the statmod package
does not support ZIP models, so I have provided my own:
qresid_zeroinfl <- function(zip_m){
y <- zip_m$y
pred_probs <- predict(zip_m, type="prob")
resids <- c()
for(i in 1:length(y)){
cdf_b <- sum(pred_probs[i,1:(y[i]+1)])
if(y[i] == 0){
cdf_a <- 0
} else {
cdf_a <- sum(pred_probs[i,1:y[i]])
}
resids[i] <- qnorm(runif(1, cdf_a, cdf_b))
}
return(resids)
}Example
Here is an example of a quantile residual plot for the Framingham heart data from class:
m_zip <- zeroinfl(cigsPerDay ~ age + education + diabetes |
age + education + diabetes,
data = heart_data)
heart_data %>%
mutate(resids = qresid_zeroinfl(m_zip)) %>%
ggplot(aes(x = age, y = resids)) +
geom_point() +
geom_smooth() +
labs(x = "Age", y = "Quantile residuals") +
theme_bw()The shape assumption looks good here! Unfortunately there is a bit more variance in the quantile residuals than we would like, which suggests a Poisson distribution may not be the best choice. We could use a zero-inflated negative binomial instead…
R structures, loops, and simulation
vectors
A vector is a simple way of storing multiple items. For example, here is a vector containing the numbers 2, 1, 4:
v <- c(2, 1, 4)
v## [1] 2 1 4
The c() in R means “combine” or “concatenate” the
elements of the vector together. The items inside the parentheses are
the elements of the vector.
Length
Every vector has a length, which tells us how many elements it contains:
length(v)## [1] 3
Indexing
We can access each item in the vector by its index. That is, we specify the position of the vector, and get back the element at that position:
v[1]## [1] 2
v[2]## [1] 1
v[3]## [1] 4
We specify the index inside the square brackets [...].
For example, v[2] is the second element of the vector
v.
Vectors can store elements other than numbers. For example, here is a vector containing the letters ‘a’, ‘b’, ‘c’, and ‘d’:
w <- c('a', 'b', 'c', 'd')
w## [1] "a" "b" "c" "d"
We can combine two vectors together with the c()
function:
c(v, w)## [1] "2" "1" "4" "a" "b" "c" "d"
Sequences (seq)
Suppose we want to create a vector containing the numbers 0, 0.1,
0.2,…,0.9, 1. In other words, a sequence of numbers between 0 and 1,
incremented by steps of 0.1. We can use the seq
function:
seq(from=0, to=1, by=0.1)## [1] 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
Here’s a sequence from 1 to 10, incremented by steps of 1:
seq(from=1, to=10, by=1)## [1] 1 2 3 4 5 6 7 8 9 10
When we want to increment by steps of 1, R has a handy shorthand:
1:10## [1] 1 2 3 4 5 6 7 8 9 10
Repetitions (rep)
Other times, we want to create a vector containing one value repeated
many times. We can use the rep function. For example,
suppose we want to create a vector of eleven 0s:
rep(0, 11)## [1] 0 0 0 0 0 0 0 0 0 0 0
for loops
Suppose we want to repeat a process many times (e.g., simulate many
sets of data). A for loop allows us to do this efficiently.
Here is a simple for loop which prints out the number 214
five times:
for(i in 1:5){
print(214)
}## [1] 214
## [1] 214
## [1] 214
## [1] 214
## [1] 214
for(...) indicates that we are making a for
loop. The code inside the curly braces { ... } is the code
which will get repeated. The index i here tells us how many
times we run this code. Remember that 1:5 is R shorthand
for 1,2,3,4,5. So i in 1:5 means “run the code inside the
loop for i=1, i=2, i=3,
i=4, and i=5”.
The really neat thing about a for loop is that what we
do inside the loop can depend on the index i! For example,
suppose we want to print out the numbers 1 to 5, instead of 214:
for(i in 1:5){
print(i)
}## [1] 1
## [1] 2
## [1] 3
## [1] 4
## [1] 5
Example: calculating a likelihood
Suppose we have a coin with an unknown probability of heads \(\pi\). Let \(Y_i\) denote the outcome of a flip, with \(Y_i = 0\) for tails and \(Y_i = 1\) for heads. Then \(Y_i \sim Bernoulli(\pi)\).
Now suppose we observe five flips of the coin: T, T, T, T, H (i.e., 0, 0, 0, 0, 1). Given this observed data, the likelihood of a value for \(\pi\) is
\[L(\pi) = \pi(1 - \pi)^4\]
We want to evaluate \(L(\pi)\) for
different values of \(\pi\) between 0
and 1; say, \(\pi = 0, 0.1, 0.2, ..., 0.9,
1\). We can do this calculation efficiently with a
for loop:
pis <- seq(from=0, to=1, by=0.1) # values of pi to consider
likelihoods <- c() # empty vector to store the likelihoods we calculate
for(i in 1:length(pis)){
# likelihood = pi*(1 - pi)^4, for each value of pi
likelihoods[i] <- pis[i]*(1 - pis[i])^4
}
likelihoods## [1] 0.00000 0.06561 0.08192 0.07203 0.05184 0.03125 0.01536 0.00567 0.00128
## [10] 0.00009 0.00000
Note: R does a good job at vectorizing
functions (when you apply a function to a vector, it gets applied to
every entry in that vector). So we could write this code without the
for loop:
pis <- seq(from=0, to=1, by=0.1)
likelihoods <- pis*(1 - pis)^4
likelihoods## [1] 0.00000 0.06561 0.08192 0.07203 0.05184 0.03125 0.01536 0.00567 0.00128
## [10] 0.00009 0.00000
Some useful notation and math
Sum and product notation
At its core, statistics relies on math, and math often uses notation to make equations and expressions simpler. Some notation that we will often encounter in STA 214 is sum (\(\sum\)) and product (\(\prod\)) notation.
Suppose we have \(n\) observations, \(Y_1,...,Y_n\). We want to take the sum of these observations: \[Y_1 + Y_2 + Y_3 + \cdots + Y_n\] Great! But this requires us to write out each of the terms, or use \(\cdots\) to express that there are terms we haven’t written out. Regardless, it is a little clunky. We can make this neater by using \(\sum\) as a shorthand for “sum” (\(\Sigma\) is the capital sigma in Greek, so think “S for Sum”). Using this notation, \[\sum \limits_{i=1}^n Y_i = Y_1 + Y_2 + \cdots + Y_n\] This shorthand means “sum up all the \(Y_i\), starting with \(i = 1\) and going to \(i = n\)”. In other words, sum up \(Y_1,...,Y_n\).
What about products? The product of \(Y_1,...,Y_n\) is \[Y_1 Y_2 \cdots Y_n\] which we can write more cleanly as \(\prod \limits_{i=1}^n Y_i\). Here \(\prod \limits_{i=1}^n\) just means “multiply these \(n\) things together”. (\(\Pi\) is the capital pi in Greek, so think “P for Product”)
Logs and properties of logs
Recall that logarithms (logs for short) are what we use to undo exponents. That is, \(\log(x) = y\) means that \(x = e^{y}\). (By default, \(\log\) in math and statistics is the natural log. Equivalent definitions exist for other bases: \(\log_{10}(x) = y\) means that \(x = 10^{y}\)).
There are some useful properties of logs that we use in STA 214, particularly for maximum likelihood estimation:
Logs turn products into sums: \[\log(a \cdot b) = \log(a) + \log(b)\] More generally, \[\log \left( \prod \limits_{i=1}^n a_i \right) = \sum \limits_{i=1}^n \log(a_i)\] Likewise, \[\log(a/b) = \log(a) - \log(b)\]
Logs turn exponents into multiples: \[\log(x^a) = a \log(x)\]
Logs have nice derivatives: \[ \dfrac{d}{dx} \log(x) = \dfrac{1}{x}\]
Calculus review: differentiation
Recall the following rules for derivative:
- \(\dfrac{d}{dx} c f(x) = c \dfrac{d}{dx} f(x)\) for constant \(c\) More generally
- \(\dfrac{d}{dx} (f(x) + c) = \dfrac{d}{dx} f(x)\) for constant \(c\)
- The derivative of a sum is the sum of derivatives: \[\dfrac{d}{dx} (f(x) + g(x)) = \dfrac{d}{dx} f(x) + \dfrac{d}{dx} g(x)\] More generally, this means that derivatives can move inside \(\sum\): \[\dfrac{d}{dx} \sum \limits_{i=1}^n f_i(x) = \sum \limits_{i=1}^n \dfrac{d}{dx} f_i(x)\]
- Chain rule: \[\dfrac{d}{dx} f(g(x)) = f'(g(x)) g'(x)\]
Writing math in R Markdown
If you want to write mathematical notation, we need to tell Markdown,
“Hey, we’re going to make a math symbol!” To do that, you use dollar
signs. For instance, to make \(\widehat{\beta}_1\), you simply put
$\widehathat{\beta}_1$ into the white space (not a chunk)
in your Markdown.
Here are some examples of writing math, which you can adapt:
| Math | Code |
|---|---|
| \(Y_i \sim Bernoulli(\pi_i)\) | $Y_i \sim Bernoulli(\pi_i)$ |
| \(\log \left( \dfrac{\pi_i}{1 - \pi_i} \right)\) | $\log \left( \dfrac{\pi_i}{1 - \pi_i} \right)$ |
| \(\widehat{\pi}_i\) | $\widehat{\pi}_i$ |
| \(\beta_0 + \beta_1 X_i\) | $\beta_0 + \beta_1 X_i$ |
| \(\sum \limits_{i=1}^n\) | $\sum \limits_{i=1}^n$ |
| \(\prod \limits_{i=1}^n\) | $\prod \limits_{i=1}^n$ |
| \(\begin{cases} e^{-\lambda_i}(1 - \alpha_i) + \alpha_i & y = 0 \\ \dfrac{e^{-\lambda_i} \lambda_i^y}{y!}(1 - \alpha_i) & y > 0 \end{cases}\) | $\begin{cases} e^{-\lambda_i}(1 - \alpha_i) + \alpha_i & y = 0 \\ \dfrac{e^{-\lambda_i} \lambda_i^y}{y!}(1 - \alpha_i) & y > 0 \end{cases}$ |
This
work was created by Ciaran Evans and is licensed under a
Creative
Commons Attribution-NonCommercial 4.0 International License. Last
updated 2023 March 31.